AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?
๐ Abstract
The article introduces AssistantBench, a new benchmark for evaluating the ability of web agents to solve realistic and time-consuming tasks. The benchmark contains diverse tasks covering various scenarios and domains, requiring agents to browse the web, identify relevant information, and synthesize outputs. The article also proposes a new web agent called SeePlanAct (SPA) that outperforms existing state-of-the-art web agents on the benchmark. The results show that AssistantBench is challenging for current systems, with no model reaching an accuracy of more than 25%. The article provides a detailed analysis of the errors made by different models, highlighting the limitations of closed-book models, retrieval-augmented models, and web agents in solving these types of tasks.
๐ Q&A
[01] Introduction
1. What are the key challenges for models that rely solely on parametric knowledge in assisting users with information-seeking tasks?
- Models that rely solely on parametric knowledge are limited in their ability to access information from the web, and are prone to hallucinations.
2. How can retrieving relevant evidence help, and what are the limitations of this approach?
- Retrieving relevant evidence can help, but the benefits are limited by the quality of the evidence retrieved. Retrieving irrelevant evidence can sometimes even hurt performance.
3. What is a more promising approach to assist users with time-consuming web tasks?
- A more promising approach is to simulate what humans do - an AI system could search the web to find relevant web pages, interact with them, and synthesize the information gathered to produce an output.
[02] AssistantBench
1. What are the criteria for the tasks in AssistantBench?
- The tasks must be realistic, time-consuming, and automatically verifiable.
2. How was AssistantBench constructed?
- AssistantBench was constructed in three main steps: (a) creating a seed set of tasks, (b) expanding the dataset with crowd-workers, and (c) collecting tasks with domain experts.
3. What are the key statistics of the AssistantBench dataset?
- AssistantBench contains a total of 181 unique tasks, covering different users, domains, and websites. The tasks require interacting with many websites, and the answers are spread across webpages on these websites.
[03] SPA: See-Plan-Act
1. How does SPA differ from the state-of-the-art web agent, SeeAct?
- SPA is built upon SeeAct and is equipped with two specialized components: (1) a planning component for the model to plan and re-plan its execution, and (2) a memory component with the option to transfer information between steps.
2. What new actions does SPA have to support open-web navigation?
- SPA has new actions that enable (a) returning to a previous page, (b) navigating to a specified URL, or (c) entering a query directly into a search engine.
[04] Experiments
1. What are the key findings from the experiments on AssistantBench?
- No system reaches an accuracy of more than 25% on AssistantBench. SPA outperforms SeeAct by 6.8 percentage points in answer rate and 9.4 percentage points in precision. An ensemble that combines SPA with a closed-book model achieves the best overall performance.
2. How do the models perform on the FanoutQA benchmark?
- Similar trends are observed on FanoutQA, with SPA outperforming SeeAct by 22.5 percentage points in answer rate and having a higher or similar precision relative to all other models.
[05] Analysis
1. When is AssistantBench particularly challenging?
- The expert-provided tasks are more challenging for closed-book models but easier for web agents than the general set. Tasks with very short or very long web interactions also tend to be more challenging for web agents.
2. What are the main causes of errors for the different model types?
- For web agents, the majority of errors are due to navigation errors. Closed-book models tend to hallucinate answers, while retrieval-augmented models often fail to retrieve relevant information.
3. How do popular chatbots like ChatGPT perform on AssistantBench tasks?
- ChatGPT also struggles with AssistantBench tasks, with the majority of errors due to over-relying on search results to generate wrong answers or hallucinating non-factual information.